# Converting all kinds of documents into text

Have a collection of documents? Word docs, HTML files, PDFs, _image-based_ PDFs, and anything else? Don't worry, Apache Tika has you covered. 

<p class="reading-options">
  <a class="btn" href="/text-analysis/processing-documents-with-apache-tika">
    <i class="fa fa-sm fa-book"></i>
    Read online
  </a>
  <a class="btn" href="/text-analysis/notebooks/Processing documents with Apache Tika.ipynb">
    <i class="fa fa-sm fa-download"></i>
    Download notebook
  </a>
  <a class="btn" href="https://colab.research.google.com/github/littlecolumns/ds4j-notebooks/blob/master/text-analysis/notebooks/Processing documents with Apache Tika.ipynb" target="_new">
    <i class="fa fa-sm fa-laptop"></i>
    Interactive version
  </a>
</p>

## Installation

These installation instructions only work on OS X, but it's possible to get the same software running on Windows.

### Tesseract

[Tesseract](https://github.com/tesseract-ocr/tesseract) is a piece of software that performs OCR, converting images of text into actual text. If we need to perform OCR on more languages than just English, we'll also need to install `tesseract-lang` to add more languages to the mix.
    
```
brew install tesseract tesseract-lang
```

### Tika

[Tika](https://tika.apache.org/) is an incredible piece of software that converts just about any kind of document to text. It requires Java - I installed Java from https://www.java.com/en/download/ and it didn't work, so you'll need to use the install command below.

```
brew cask install adoptopenjdk
brew install tika 
```

Tika will automatically know about tesseract.

### Python bindings for Tika

Tika is a piece of software that exists _outside of Python_. If we want Python to be able to use Tika, we'll need to install the **Python bindings** for TIka.

```
pip install tika
```

If you'd like to just run this all from the notebook, uncomment and run the cell below. **You'll need to type in your password for the `adoptopenjdk` one, so be sure to pay attention to when it asks you.**

In [1]:
# !brew install tesseract tesseract-lang
# !brew cask install adoptopenjdk
# !brew install tika 
# !pip install tika

## Documents we'll be using

* `pdf`: https://data.ct.gov/download/fxjv-82m6/application/pdf
* `doc`: https://pasteur.epa.gov/uploads/10.23719/1500001/LDPE_nanoclay_Highlights_.docx
* `png` for OCR: https://upload.wikimedia.org/wikipedia/commons/5/5f/Dr._Jekyll_and_Mr._Hyde_Text.jpg

## Confirm tesseract works

In [1]:
# Download the image
!curl -O https://upload.wikimedia.org/wikipedia/commons/5/5f/Dr._Jekyll_and_Mr._Hyde_Text.jpg

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  822k  100  822k    0     0  3259k      0 --:--:-- --:--:-- --:--:-- 3264k


![](https://upload.wikimedia.org/wikipedia/commons/5/5f/Dr._Jekyll_and_Mr._Hyde_Text.jpg)

In [3]:
!tesseract Dr._Jekyll_and_Mr._Hyde_Text.jpg stdout

at his touch ofa certain icy pang along my blood. “Come, sir,’ said I.
“You forget that I have not yet the pleasure of your acquaintance. Be
seated, if you please.” And I showed him an example, and sat down
myself in my customary seat and with as fair an imitation of my or-
dinary manner to a patient, as the lateness of the hour, the nature of
my preoccupations, and the horror I had of my visitor, would suffer
me to muster.

“I beg your pardon, Dr. Lanyon,” he replied civilly enough. “What
you say is very well founded; and my impatience has shown its heels
to my politeness. I come here at the instance of your colleague, Dr.
Henry Jekyll, on a piece of business of some moment; and I under-
stood...” He paused and put his hand to his throat, and I could see,
in spite of his collected manner, that he was wrestling against the
approaches of the hysteria—“I understood, a drawer...”

But here I took pity on my visitor’s suspense, and some perhaps
on my own growing curiosity.

## Using Tika

### Starting it up

In [4]:
import tika
import requests
from tika import parser

# Start running the tika service
tika.initVM()

### Doing your parsing

There are two ways to do it!

**Right from the web**

```python
response = requests.get(...)
results = parser.from_buffer(response.content)
```

**From a downloaded file**

```python
results = parser.from_file(filename)
```

Note if you want to do **non-English OCR**, you need to change things up a bit. The one below for Greek. See what your tesseract supports with `tesseract --list-langs`

```python
headers = {
    "X-Tika-OCRLanguage": "grc"
}

results = parser.from_buffer(response.content, headers=headers)
```

## Examples

### PDF example

The first time it will be very slow, as it's... downloading Tika again, I think?

In [6]:
response = requests.get('https://data.ct.gov/download/fxjv-82m6/application/pdf')
results = parser.from_buffer(response)

In [7]:
results.keys()

dict_keys(['status', 'content', 'metadata'])

In [8]:
results['status']

200

In [9]:
# Only showing the first 500 chars because there are SO MANY
results['content'][:1000]

'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n  \n\n  \n\n \n\n \n\nConnecticut \n\nOpen Data \n\nPolicy \nEffective April 22, 2015 \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\nPromulgated in accordance with and \n\nunder the authority of Executive \n\nOrder 39 of Governor Dannel P. \n\nMalloy \n\n \n\n  \n\n  \n\n \n\n  \n \n\n  \n\n\n\n \n\n \n\nContents \n\n \n\n \n1.0 Definitions .......................................................................................................................... 3 \n\n2.0  Introduction...................................................................................................................... 5 \n\n2.1  Intent ............................................................................................................................ 5 \n\n2.2  Scope ............................................................................................................................ 5 \n\n2.3  Legal Consid

In [10]:
# Only showing the first 10000 chars
print(results['content'][:10000].strip())

Connecticut 

Open Data 

Policy 
Effective April 22, 2015 

 

 

 

 

 

 

 

Promulgated in accordance with and 

under the authority of Executive 

Order 39 of Governor Dannel P. 

Malloy 

 

  

  

 

  
 

  



 

 

Contents 

 

 
1.0 Definitions .......................................................................................................................... 3 

2.0  Introduction...................................................................................................................... 5 

2.1  Intent ............................................................................................................................ 5 

2.2  Scope ............................................................................................................................ 5 

2.3  Legal Considerations ....................................................................................................... 5 

3.0  Open Data Policy Requirements .......................

### Word doc example

In [14]:
response = requests.get('https://pasteur.epa.gov/uploads/10.23719/1500001/LDPE_nanoclay_Highlights_.docx')
results = parser.from_buffer(response)
print(results['content'].strip())

Highlights 

Evaluating Weathering of Food Packaging Polyethylene-Nano-clay Composites: Release of Nanoparticles and their Impacts

Changseok Han1, Amy Zhao1, and Eunice Varughese2, E. Sahle-Demessie*1




1. UV or O3 degradation food packaging composites released nanoclay particles. 
2. Properties of nanocomposites changed during accelerated weathering.
3. Nanoclay release was proportional to weathering time.
4. Toxicity of released nanoclay at test concentrations were not significant.


### OCR image example

It will work the same with a PDF instead of an image.

In [17]:
response = requests.get('https://upload.wikimedia.org/wikipedia/commons/5/5f/Dr._Jekyll_and_Mr._Hyde_Text.jpg')
results = parser.from_buffer(response)
results['status']

200

In [18]:
print(results['content'].strip())

at his touch ofa certain icy pang along my blood. “Come, sir,’ said I.
“You forget that I have not yet the pleasure of your acquaintance. Be
seated, if you please.” And I showed him an example, and sat down
myself in my customary seat and with as fair an imitation of my or-
dinary manner to a patient, as the lateness of the hour, the nature of
my preoccupations, and the horror I had of my visitor, would suffer
me to muster.

“I beg your pardon, Dr. Lanyon,” he replied civilly enough. “What
you say is very well founded; and my impatience has shown its heels
to my politeness. I come here at the instance of your colleague, Dr.
Henry Jekyll, on a piece of business of some moment; and I under-
stood...” He paused and put his hand to his throat, and I could see,
in spite of his collected manner, that he was wrestling against the
approaches of the hysteria—“I understood, a drawer...”

But here I took pity on my visitor’s suspense, and some perhaps
on my own growing curiosity.

“There it is, s

### Using local files

In [20]:
# Save the file locally
!curl -O https://upload.wikimedia.org/wikipedia/commons/5/5f/Dr._Jekyll_and_Mr._Hyde_Text.jpg

results = parser.from_file('Dr._Jekyll_and_Mr._Hyde_Text.jpg')
print(results['content'].strip())

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  822k  100  822k    0     0  2594k      0 --:--:-- --:--:-- --:--:-- 2595k
at his touch ofa certain icy pang along my blood. “Come, sir,’ said I.
“You forget that I have not yet the pleasure of your acquaintance. Be
seated, if you please.” And I showed him an example, and sat down
myself in my customary seat and with as fair an imitation of my or-
dinary manner to a patient, as the lateness of the hour, the nature of
my preoccupations, and the horror I had of my visitor, would suffer
me to muster.

“I beg your pardon, Dr. Lanyon,” he replied civilly enough. “What
you say is very well founded; and my impatience has shown its heels
to my politeness. I come here at the instance of your colleague, Dr.
Henry Jekyll, on a piece of business of some moment; and I under-
stood...” He paused and put his hand to his throat, and I could s